Policy Evaluation in Continuous MDPs With Efficient Kernelized Gradient Temporal Difference
نویسندگان
چکیده
We consider policy evaluation in infinite-horizon discounted Markov decision problems with continuous compact state and action spaces. reformulate this task as a compositional stochastic program function-valued variable that belongs to reproducing kernel Hilbert space (RKHS). approach problem via new functional generalization of quasi-gradient methods operating tandem sparse subspace projections. The result is an extension gradient temporal difference learning yields nonlinearly parameterized value function estimates the solution Bellman equation. call method parsimonious learning. Our main contribution memory-efficient nonparametric guaranteed converge exactly fixed point probability 1 attenuating step-sizes under hypothesis it RKHS. Further, constant compression budget, we establish mean convergence neighborhood have finite complexity. In Mountain Car domain, observe faster lower error solutions than existing approaches fraction required memory.
منابع مشابه
Gradient Temporal Difference Networks
Temporal-difference (TD) networks (Sutton and Tanner, 2004) are a predictive representation of state in which each node is an answer to a question about future observations or questions. Unfortunately, existing algorithms for learning TD networks are known to diverge, even in very simple problems. In this paper we present the first sound learning rule for TD networks. Our approach is to develop...
متن کاملPolicy gradient in continuous time
Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is c...
متن کاملAccelerated Gradient Temporal Difference Learning
The family of temporal difference (TD) methods span a spectrum from computationally frugal linear methods like TD(λ) to data efficient least squares methods. Least square methods make the best use of available data directly computing the TD solution and thus do not require tuning a typically highly sensitive learning rate parameter, but require quadratic computation and storage. Recent algorith...
متن کاملNonlinear Policy Gradient Algorithms for Noise-Action MDPs
We develop a general theory of efficient policy gradient algorithms for Noise-Action MDPs (NMDPs), a class of MDPs that generalize Linearly Solvable MDPs (LMDPs). For finite horizon problems, these lead to simple update equations based on multiple rollouts of the system. We show that our policy gradient algorithms are faster than the PI algorithm, a state of the art policy optimization algorith...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Automatic Control
سال: 2021
ISSN: ['0018-9286', '1558-2523', '2334-3303']
DOI: https://doi.org/10.1109/tac.2020.3029315